Cocojunk

🚀 Dive deep with CocoJunk – your destination for detailed, well-researched articles across science, technology, culture, and more. Explore knowledge that matters, explained in plain English.

Navigation: Home

Polyglot (computing)

Published: Sat May 03 2025 19:23:38 GMT+0000 (Coordinated Universal Time) Last Updated: 5/3/2025, 7:23:38 PM

Read the original article here.

The Forbidden Code: Polyglots - Speaking in Tongues to Computers

Welcome to the underground. Here, we explore techniques that push the boundaries of what computers and code are supposed to do. Forget the clean, predictable world of single-language projects and standard file formats. We're diving into the fascinating, sometimes illicit, world of polyglots – single files designed to be valid and functional in multiple languages or formats simultaneously.

This isn't just about clever tricks; it's about understanding how interpreters and parsers work at a fundamental level – examining the stream of bytes and finding ways to whisper different instructions to different listeners from the same script. This technique can be used for surprising compatibility, but it's often found lurking in the shadows, used for obfuscation, challenging security systems, or hiding malicious payloads in plain sight.

What is a Polyglot?

Let's start with a clear definition, just like they might give you in school, before we twist it for our purposes:

In computing, a polyglot is a computer program, script, or other file written in a valid form of multiple programming languages or file formats. The name is analogous to multilingualism, referring to the ability to speak many languages. A polyglot file is composed by combining syntax from two or more different formats or languages.

Think of it like a secret agent leaving a message that looks like an innocent shopping list to one person but contains a coded operational plan when read by someone else who knows the cipher. The "cipher" here is the specific syntax and parsing rules of different computer languages or file formats.

When the polyglot is designed to be interpreted as source code, it's specifically called a polyglot program. However, the core principle applies to any file type. The key insight is that all files are ultimately just streams of bytes. Different programs read these bytes and interpret them according to their own rules (their syntax, format specifications, etc.). A polyglot works by arranging these bytes in a way that satisfies the interpretation rules of more than one program or language simultaneously.

This duality is powerful. It can bridge compatibility gaps or, more interestingly from an "underground" perspective, it can allow a file to pass validation checks for one format while containing hidden instructions or data for another.

The Craft of Construction: Building a Multi-Lingual File

How do you get a single sequence of bytes to make sense to different interpreters? It's a meticulous process of leveraging the specific parsing rules of the target formats/languages. The main techniques involve:

Exploiting Comments: Many languages use specific characters or sequences to mark comments (code that should be ignored by the interpreter). By hiding code for one language within what is a comment in another, you can effectively make parts of the file invisible to certain parsers.
Leveraging Different Syntax Rules: Some characters or sequences have different meanings depending on the language. A character that starts a command in one might be part of a string literal in another.
Conditional Interpretation: Sometimes, syntax can be combined such that one interpreter reads it one way (perhaps defining a function or variable), while another reads it differently (maybe as a series of operations or just ignored characters).
Structure Compatibility: File formats often have headers, footers, or specific sections. A polyglot can sometimes work by ensuring the header satisfies one format's requirements while the rest (or parts of the rest) satisfy another, or by strategically placing different data within segments ignored by one format but read by another (like comment fields or padding).

The primary challenge is ensuring that the syntax/data intended for one interpreter doesn't cause errors or unexpected behavior when parsed by another. This often means clever use of comments to "wrap" code or data.

Example: A Polyglot in C, PHP, and Bash

Let's look at a classic example demonstrating some of these techniques. Consider a single file designed to run as a C program, a PHP script, and a Bash shell script.

#define main() int main(void) /*
<?php function main() {
//*/
#include <stdio.h>
/*
}
//*/
int main(void) {
  int x=0; // This is a comment in C and PHP. It's part of the path in Bash, but that's okay here.
  if (($x)) { // Valid syntax in both Bash and PHP. C preprocessor handles it.
    printf("Hello from PHP or Bash!\n");
  } else {
    printf("Hello from C!\n");
  }
  return 0;
}
//<?php
// main(); // Call main function only if executed as PHP.
//?>
main # Call main function only if executed as Bash.

Let's break down how this seemingly chaotic file works for each interpreter:

As a C Program:
- The first line #define main() int main(void) /* is a C preprocessor directive. It defines the macro main(). The /* starts a multi-line comment.
- The preprocessor replaces main() later in the code.
- The lines between /* and */ are treated as comments and ignored. This hides the PHP code.
- #include <stdio.h> is a standard C directive.
- int main(void) { ... } is the C entry point. Notice main(void) is what the #define expanded to.
- // starts a single-line comment in C.
- if (($x)) is a bit odd for C, but the preprocessor might resolve it depending on context (or it could be inside a preprocessor block). In this specific basic example, it might rely on the #define trickery around the function itself or potentially cause a compile warning/error in strict C, but it can be crafted to work with careful preprocessor use. A simpler C polyglot might rely purely on comments.
- printf("Hello from C!\n"); is standard C output.
- return 0; is standard C exit.
- The lines //<?php... main();...?> and main at the end are single-line comments in C.
- Result: Compiles and runs the C code, printing "Hello from C!".
As a PHP Script:
- The file starts with #define main() int main(void) /*. PHP treats lines starting with # as comments. The /* is just part of the comment content.
- <?php is the opening tag for PHP code. Everything after this tag is interpreted as PHP until a closing ?> is found (or the end of the file).
- function main() { ... } defines a PHP function.
- //*/ is a single-line comment in PHP.
- #include <stdio.h> is a comment in PHP.
- /* ... */ is a multi-line comment in PHP, hiding the C-specific function definition and the core logic.
- //<?php main(); // Call main function only if executed as PHP. ?> - The outer // makes the entire line a comment in PHP. However, the <?php and ?> tags still affect the parser. The PHP parser finds the <?php tag (even inside a comment) and potentially starts looking for PHP code, then finds the ?> tag and stops. This interaction of comment styles and parser tags is crucial. A more robust PHP polyglot might carefully place tags outside comments. The provided example's notes specifically mention these tags having effect even in commented lines, which is a specific quirk exploited here.
- main at the end of the file is outside PHP tags and would likely be ignored or cause an error depending on PHP configuration.
- The actual logic inside the function main() for PHP includes if (($x)) { printf("Hello from PHP or Bash!\n"); } else { ... }. printf is a standard PHP function, (($x)) is valid syntax (evaluates boolean context).
- Result: Executes the PHP code (if the main() function is called somewhere, which the notes say isn't explicitly done in PHP in this version, highlighting that functionality doesn't have to be identical), potentially printing "Hello from PHP or Bash!". Correction based on notes: The PHP main function is defined but not called in this specific polyglot structure. So it would likely produce no output unless the outer structure was changed or configured differently. This demonstrates that the interpretation is valid, even if the execution path differs.
As a Bash Script:
- Lines starting with # are comments in Bash (#define..., #include...).
- /* and */ are just text within the comments.
- <?php and ?> are just text that Bash sees as part of commented or ignored lines.
- int main(void) { ... } is just text, likely causing syntax errors if hit, but the structure should avoid this.
- // is interpreted as the root directory / followed by /. This is valid Bash syntax, often used in path names. In this context, on its own line or within a comment, it doesn't execute code but might be part of a string or path evaluation if not careful.
- if (($x)) is valid Bash syntax for an arithmetic test. ((...)) performs arithmetic evaluation. $x expands to the value of the variable x.
- printf is a Bash shell builtin command. The syntax printf("...") is not standard Bash; Bash printf does not use parentheses around the format string and arguments. This is a key exploit! The notes mention the C preprocessor adds the brackets for C compilation. So, in Bash, printf("...") would likely be interpreted as trying to execute a command named printf("...") which would fail, or the Bash interpretation must strategically avoid this line using other means (like the if statement potentially). The example notes suggest the Bash printf is identical except for omitting brackets, implying the Bash interpretation path hits a line like printf "Hello from PHP or Bash!\n" while the C path hits printf("Hello from C!\n"). This points to a more complex version of the code than shown, where the actual execution paths diverge based on interpreter syntax. Let's assume for educational purposes, the if/else structure, despite the printf parenthesis issue, is intended to route execution based on interpreter characteristics or variables set earlier.
- The line main # Call main function only if executed as Bash. is the crucial part. main is treated as a command name, and # starts a comment. Bash attempts to execute a command named main. The function main() { ... } defines a Bash function named main.
- Result: Executes the Bash function main, potentially printing "Hello from PHP or Bash!".

This example, while slightly complex in its interaction nuances, highlights how comments, different syntax meanings (#, //), builtins vs. functions (printf), and execution flow (if, function calls) are manipulated to create a single file readable by multiple systems.

Other simple techniques include:

Placing one format's header at the start and another's data/footer at the end (common for formats like GIF/ZIP/JAR).
Using padding (null bytes) to create sections ignored by one parser where another's data can be hidden (Cavities).

A Brief History in the Shadows

Polyglots aren't new. Like many clever low-level tricks, they have roots in hacker culture:

Early Puzzles and Curios: Polyglots were crafted as intellectual challenges and demonstrations of deep technical understanding starting at least in the early 1990s. They were shared on platforms like Usenet groups, showcasing the programmer's ability to bend the rules of multiple languages.
Obfuscated Code Contests: The International Obfuscated C Code Contest (IOCCC) often features entries that explore language boundaries and surprising code behaviors, and polyglots have been winning entries.
Malware and Covert Channels: In the 21st century, the utility of polyglots for hiding information became apparent for less benign purposes. They could be used to smuggle malicious code inside seemingly harmless files or create covert communication channels that bypass standard filtering.

Underground Typologies: Different Flavors of Polyglots

Polyglots can be categorized based on how the different formats or languages are layered or combined:

Stacks: Simple concatenation where one file format follows another. Less common for true polyglots where the entire file is valid in both, but can work if one format has a very flexible footer or end-of-file marker, and the other has a flexible header.
Parasites: A secondary file format or malicious payload is hidden within comment fields or other ignored sections of a primary, ostensibly benign file format. The "parasite" is carried by the "host."
Zippers: Similar to parasites, but two files are mutually arranged within each other's comment fields or ignored sections, creating a kind of interwoven structure.
Cavities: A secondary file format or data is hidden within null-padded or unused areas of the primary file structure. This technique often requires detailed knowledge of the primary file format's layout and potential empty spaces.

Legitimate Applications (Even the Underground Has Rules)

While often associated with clever exploits, polyglot techniques also have valid uses:

Polyglot Markup (HTML5 and XHTML): This is a widely adopted standard technique. A single document is written so it is valid when parsed according to both the HTML5 specification and the XML specification. This allows serving the same file with different MIME types (text/html or application/xhtml+xml), ensuring compatibility with different browser engines or processing tools while guaranteeing the same resulting document structure (DOM). Key rules for achieving this include specific DOCTYPE declaration, using well-formed XML syntax (quoted attributes, self-closing tags for void elements like <br/>), and avoiding XML processing instructions.
Composing Formats: Some file formats are designed with polyglotting in mind. The DICOM (medical imaging) format, for instance, allows combining its structure with TIFF, meaning a single file can be viewed by either a DICOM viewer or a TIFF viewer.
Compatibility Layers: In some cases, polyglots are used to bridge compatibility between different versions of a language. A single Python script might be written using a subset of syntax valid in both Python 2 and Python 3, allowing it to run on systems with either interpreter.

These legitimate uses demonstrate that the core principle – making a file understandable to multiple systems – isn't inherently malicious, but the way it's used often determines its "forbidden" status.

The Dark Side: Security Implications and Exploits

Here's where polyglots truly enter the realm of forbidden code. Their ability to be interpreted differently is a golden opportunity for bypassing security controls, hiding malicious payloads, and exploiting vulnerabilities in parsing engines.

Bypassing File Type Checks: Many security systems check a file's type based on its extension or a simple check of the header (like the first few bytes, known as a "magic number"). A polyglot can be crafted to have a valid header and extension for a seemingly harmless type (like a JPEG image or a PDF document) while containing executable code or malicious data that another system (like a vulnerable image renderer or a specific processing library) will interpret and execute. This is a form of steganography, hiding data within another file.
Exploiting Parser Differences: The root cause of many polyglot-based exploits is that different software parsers for the same file format might have slightly different levels of strictness, handle errors differently, or have varying interpretations of ambiguous sections. A file can be crafted to be benign according to a strict validator but malicious according to a more lenient or buggy parser.
SQL Injection as a Trivial Polyglot: While not a file polyglot, SQL Injection is a classic example of the principle of polyglotting in a command context. User-provided input (expected to be just data) is crafted to contain valid SQL syntax that changes the meaning of the command being built by the application. The application intends to interpret the input as data; the database interpreter sees it as code.
Flexible File Formats: Formats known for their complexity or flexibility (like PDF or image formats with extensive metadata/comment fields) offer more surface area for polyglotting. For example, while the PDF specification says the %PDF magic number should be at the very beginning, many parsers are tolerant and will find it within the first KB. This leaves space at the start to embed other data, allowing a file to be, say, a valid Bash script and a valid PDF. The PDF format is notoriously complex, enabling even PDF-PDF polyglots that render completely different content depending on the specific PDF reader used.
Detection Challenges: Standard antivirus and intrusion detection systems often rely on signatures or simple file type identification. Polyglots can evade these by appearing as one type while containing malicious code for another, or by hiding the malicious payload in sections the standard tools don't scrutinize for that file type. Detecting polyglot malware requires deeper analysis that understands the parsing rules of multiple formats simultaneously.
Case Study: PE-DICOM Polyglots: A particularly insidious example involved crafting files that were valid DICOM medical images and valid Windows Portable Executables (PE), which is the format for executables and DLLs. When viewed by a medical system, it was a standard image. When processed by a susceptible system component or if executed, it ran malicious code. This created unique challenges for incident response, as deleting the file meant deleting patient health information, essentially fusing the malware to sensitive data.
Case Study: The GIFAR Attack: This is a classic polyglot exploit targeting web applications. A GIFAR is a single file that is simultaneously a valid GIF image and a valid Java Archive (JAR) file (which is based on the ZIP format). How? GIF places its header at the beginning, while ZIP/JAR places its central directory and end-of-archive records at the end. You can craft a file that starts with a GIF header (making it a valid image) and ends with ZIP structures (making it a valid JAR). An attacker could upload this file to a website allowing image uploads. If the site or a client (like a web browser with a vulnerable Java plugin) later tried to process the file as a JAR (perhaps due to a vulnerability related to file handling or same-origin policies), the embedded malicious Java code could be executed, often with the authority of the website itself. This specific attack was patched, but it highlights the fundamental vulnerability: a system treating a file as one thing while it secretly contains instructions for another.

Beyond the File: Related Concepts

While we've focused on polyglots as single files, the concept of using multiple "languages" or systems together extends further:

Polyglot Programming: This term refers to the practice of building an entire system or application using multiple programming languages. Different components might be written in languages best suited for their task (e.g., performance-critical parts in C++, web frontends in JavaScript, backend logic in Python). This doesn't necessarily involve polyglot files, but rather different language compilers/interpreters interacting within a larger architecture.
Polyglot Persistence: Similar to polyglot programming, this refers to using multiple different types of data stores (databases) within a single application based on the nature of the data (e.g., relational database for structured data, NoSQL document store for flexible data, graph database for relationships).

In Conclusion

Polyglots represent a powerful, albeit often underground, technique rooted in a deep understanding of how different software interprets sequences of bytes. From clever coding challenges and bridging compatibility gaps to serving as covert channels and vectors for sophisticated malware, polyglots demonstrate that the apparent file type or language is just one layer of meaning. By mastering how to layer different syntactic and structural rules within a single file, one gains the ability to communicate different messages to different systems simultaneously – a true forbidden code technique.